Data Science Project¶

The educational dataset was explored, and a classification machine learning model was built by me..While doing the project, I progressed task by task and it was explained what I needed to do in each task.

In the first part, I explored different tables using functions such as (head, shape, info, and describe). I dealt with duplicate records and missing values, and performed some exploratory tasks.

In the second part, I built a classification model and reported its accuracy.

About the Dataset¶

I you worked on a public educational dataset shared by UK Open University. Although throughout this document you will be provided with sufficient information about this public dataset, you are strongly recommended to refer to https://analyse.kmi.open.ac.uk/open_dataset for more detailed information.

There is also Kaggle page where you can see some analysis performed shared by other publicly. I think they can be also helpful if you want to explore the dataset beyond this assignment. https://www.kaggle.com/datasets/rocki37/open-university-learning-analytics-dataset

1. Exploratory Analysis¶

1.1. Courses Table¶

Courses table (courses.csv) contains the list of all available modules and their presentations.

The columns are:

  • code_module – code name of the module, which serves as the identifier.
  • code_presentation – code name of the presentation. It consists of the year and “B” for the presentation starting in February and “J” for the presentation starting in October.
  • length - length of the module-presentation in days.

The structure of B and J presentations may differ and therefore it is good practice to analyse the B and J presentations separately. Nevertheless, for some presentations the corresponding previous B/J presentation do not exist and therefore the J presentation must be used to inform the B presentation or vice versa. In the dataset this is the case of CCC, EEE and GGG modules.

In [1]:
import numpy as np
import pandas as pd
import plotly.express as px
import matplotlib.pyplot as plt
import math
import seaborn as sns
from IPython.display import Image
import plotly.io as pio

TASK1: Identify and treat duplicate/missing values (if there is any).

In [2]:
courses_data = pd.read_csv("courses.csv")
print(courses_data.head())
  code_module code_presentation  module_presentation_length
0         AAA             2013J                         268
1         AAA             2014J                         269
2         BBB             2013J                         268
3         BBB             2014J                         262
4         BBB             2013B                         240
In [3]:
courses_data.isnull().sum()
Out[3]:
code_module                   0
code_presentation             0
module_presentation_length    0
dtype: int64

courses_data don't have na values

In [4]:
courses_data.duplicated().sum()
Out[4]:
0

courses_data don't include duplicate values

TASK2: Find out how many courses started in February vs October, and compare their length. Interpret the results.

In [5]:
courses_data['start_month'] = courses_data['code_presentation'].str[-1]
feb_courses = courses_data[courses_data['start_month'] == 'B']
oct_courses = courses_data[courses_data['start_month'] == 'J']
total_feb_courses = feb_courses['module_presentation_length'].sum()
total_oct_courses = oct_courses['module_presentation_length'].sum()
avg_length_feb = feb_courses['module_presentation_length'].mean()
avg_length_oct = oct_courses['module_presentation_length'].mean()

print(f"Number of courses starting in February: {len(feb_courses)},their total length: {total_feb_courses}")
print(f"Number of courses starting in October: {len(oct_courses)},their total length: {total_oct_courses}")
print(f"Average length of courses starting in February: {avg_length_feb:.2f} days")
print(f"Average length of courses starting in October: {avg_length_oct:.2f} days")
Number of courses starting in February: 9,their total length: 2159
Number of courses starting in October: 13,their total length: 3463
Average length of courses starting in February: 239.89 days
Average length of courses starting in October: 266.38 days

according to meaning average presentations october longer than february

1.2. Student Info Table¶

StudentInfo (studentInfo.csv) file contains demographic information about the students together with their final result. File contains the following columns:

  • code_module – an identification code for a module on which the student is registered.
  • code_presentation - the identification code of the presentation during which the student is registered on the module.
  • id_student – a unique identification number for the student.
  • gender – the student’s gender.
  • region – identifies the geographic region, where the student lived while taking the module-presentation.
  • highest_education – highest student education level on entry to the module presentation.
  • imd_band – specifies the Index of Multiple Depravation band of the place where the student lived during the module-presentation.
  • age_band – band of the student’s age.
  • num_of_prev_attempts – the number times the student has attempted this module.
  • studied_credits – the total number of credits for the modules the student is currently studying.
  • disability – indicates whether the student has declared a disability.
  • final_result – student’s final result in the module-presentation.
In [6]:
studentInfo_data = pd.read_csv("studentInfo.csv")
print(studentInfo_data.head())
  code_module code_presentation  id_student gender                region  \
0         AAA             2013J       11391      M   East Anglian Region   
1         AAA             2013J       28400      F              Scotland   
2         AAA             2013J       30268      F  North Western Region   
3         AAA             2013J       31604      F     South East Region   
4         AAA             2013J       32885      F  West Midlands Region   

       highest_education imd_band age_band  num_of_prev_attempts  \
0       HE Qualification  90-100%     55<=                     0   
1       HE Qualification   20-30%    35-55                     0   
2  A Level or Equivalent   30-40%    35-55                     0   
3  A Level or Equivalent   50-60%    35-55                     0   
4     Lower Than A Level   50-60%     0-35                     0   

   studied_credits disability final_result  
0              240          N         Pass  
1               60          N         Pass  
2               60          Y    Withdrawn  
3               60          N         Pass  
4               60          N         Pass  

TASK1: Identify and treat duplicate/missing values (if there is any)

In [7]:
studentInfo_data.duplicated().sum()
Out[7]:
0

there are no duplicated values

In [8]:
studentInfo_data.isnull().sum()
Out[8]:
code_module                0
code_presentation          0
id_student                 0
gender                     0
region                     0
highest_education          0
imd_band                1111
age_band                   0
num_of_prev_attempts       0
studied_credits            0
disability                 0
final_result               0
dtype: int64
In [9]:
studentInfo_data.shape
Out[9]:
(32593, 12)

data have 1111 missing values in imd_band,removing imd_band columns would result in a loss of information for our analysis. The data has 32593 rows, which is much larger than 1111. For this reason, we will remove the rows containing missing values.

In [10]:
studentInfo_data.dropna(inplace=True)
In [11]:
studentInfo_data.isnull().sum()
Out[11]:
code_module             0
code_presentation       0
id_student              0
gender                  0
region                  0
highest_education       0
imd_band                0
age_band                0
num_of_prev_attempts    0
studied_credits         0
disability              0
final_result            0
dtype: int64
In [12]:
studentInfo_data.shape
Out[12]:
(31482, 12)

this is new shape of this data

TASK2: Treating categorical variables.

For this table, besides fixing any potential issue about duplicate/missing values, you are expected to explore the categorical variables: such as highest_education, imd_band, and age_band.

In particular, you may want to check if some categories of highest_education, imd_band, age_band variables (e.g., Post Graduate Qualification in highest_education) contain few instances. In such cases, you may need to merge the minority categories with a major category and even dedice to create a new set of (fewer) categories based on the existing ones. In some cases, you may even want to decide the reduce the number of categories (if you think they are many).

As long as you provide the rationale, you can decide such details by yourself. You should work on AT LEAST TWO categorical variables in this task.

In [13]:
studentInfo_data.describe(include='all').loc[['unique'],:]
Out[13]:
code_module code_presentation id_student gender region highest_education imd_band age_band num_of_prev_attempts studied_credits disability final_result
unique 7 4 NaN 2 13 5 10 3 NaN NaN 2 4

the categorical variables are code_module,code_presentation,gender,region, highest_education, imd_band, age_band, disability, final_result

highest_education

In [14]:
px.bar(studentInfo_data['highest_education'].value_counts())

No formal quals and post graduate qualification are very low compared to the others and they can be misleading so I will combine them with A level or equivalent.

In [15]:
studentInfo_data['highest_education'] = studentInfo_data['highest_education'].replace({
'Post Graduate Qualification':'A Level or Equivalent',
'No Formal quals':'A Level or Equivalent'
})
In [16]:
px.bar(studentInfo_data['highest_education'].value_counts())

imdb_band

In [17]:
px.bar(studentInfo_data['imd_band'].value_counts())

The percent sign for the 10-20 bar is not placed

In [18]:
studentInfo_data['imd_band'] = studentInfo_data['imd_band'].replace({'10-20':'10-20%'})
px.bar(studentInfo_data['imd_band'].value_counts())

The values of consecutive data are very close to each other, so by combining consecutive data, I will reduce the number of categories to 5, that is, each of them will represent 20 percent.

In [19]:
studentInfo_data['imd_band'] = studentInfo_data['imd_band'].replace({
'0-10%':"0-20%",'10-20%':"0-20%",
'20-30%':"20-40%",'30-40%':"20-40%",
'40-50%':"40-60%",'50-60%':"40-60%",
'60-70%':"60-80%",'70-80%':"60-80%",
'80-90%':"80-100%",'90-100%':"80-100%"
})
In [20]:
px.bar(studentInfo_data['imd_band'].value_counts())

age_band

In [21]:
px.bar(studentInfo_data['age_band'].value_counts())

The rate of people over 55 years old is very low when compared to other age groups. age_band is divided into two categories: those over 35 years of age and those younger than 35 years of age.

In [22]:
studentInfo_data['age_band'] = studentInfo_data['age_band'].replace({
'55<=':'35<=',
'35-55':"35<=",
'0-35':"35>"
})
In [23]:
px.bar(studentInfo_data['age_band'].value_counts())

TASK3: Demographic Information and Performance Levels

More importantly for this table you are expected to explore various relationships between final_result and at least three categorical variable (e.g., did students with HE qualification perform better, did students with low imd_band withdraw more often, or did geographic region play any role? etc.). For this purpose you can visualize data and compute some basic statistics.

You must use at least two different chart types (e.g., bar or line or pie) to illustrate how the success/failure rates differ between different categories (e.g., education level, regions, imd_band, age, etc.). At least in one case, the chart should also denote the gender to illustrate the possible interaction between gender and the other categorical variable (e.g., do european females perform better than asian males -just an example :)).

In [24]:
px.sunburst(
data_frame=studentInfo_data,
path=['gender', 'highest_education', 'age_band'],
values=studentInfo_data.final_result.factorize()[0],
width=1000
)

When you look at the graphs, many comments can be made, for example value of HE Qulification of men with all age groups higher than the women

In [26]:
sns.catplot(data=studentInfo_data.loc[:,['gender', 'age_band', 'region',
'final_result']], x='final_result',
col='gender', row='region', hue='age_band',
hue_order=['35>', '35<='], kind='count')
C:\Users\sidar\anaconda3\Lib\site-packages\seaborn\axisgrid.py:118: UserWarning:

The figure layout has changed to tight

Out[26]:
<seaborn.axisgrid.FacetGrid at 0x1aa30407b90>

When you look at the graphs, many comments can be made, for example, in the Scottish region, men over 35 are more successful than women over 35.

1.3. Registration Table¶

Registration table (studentRegistration.csv) contains information about the time when the student registered for the module presentation. For students who unregistered the date of unregistration is also recorded. File contains five columns:

  • code_module – an identification code for a module.
  • code_presentation - the identification code of the presentation.
  • id_student – a unique identification number for the student.
  • date_registration – the date of student’s registration on the module presentation, this is the number of days measured relative to the start of the module-presentation (e.g. the negative value -30 means that the student registered to module presentation 30 days before it started).
  • date_unregistration – date of student unregistration from the module presentation, this is the number of days measured relative to the start of the module-presentation. Students, who completed the course have this field empty. Students who unregistered have Withdrawal as the value of the final_result column in the studentInfo.csv file.

TASK1: As the first task, you need to ensure that there are no conflicts between studentRegistration.csv and studentInfo.csv dataset in terms of Withdrawal status of unregistered students. For example, if a student unregistered from a course at some point (which can be found in "studentRegistration.csv"), his/her final_result should be Withdrawal.

In [27]:
regdata = pd.read_csv("studentRegistration.csv")
In [28]:
print(regdata.head())
  code_module code_presentation  id_student  date_registration  \
0         AAA             2013J       11391             -159.0   
1         AAA             2013J       28400              -53.0   
2         AAA             2013J       30268              -92.0   
3         AAA             2013J       31604              -52.0   
4         AAA             2013J       32885             -176.0   

   date_unregistration  
0                  NaN  
1                  NaN  
2                 12.0  
3                  NaN  
4                  NaN  
In [29]:
studentInfo_data.head()
Out[29]:
code_module code_presentation id_student gender region highest_education imd_band age_band num_of_prev_attempts studied_credits disability final_result
0 AAA 2013J 11391 M East Anglian Region HE Qualification 80-100% 35<= 0 240 N Pass
1 AAA 2013J 28400 F Scotland HE Qualification 20-40% 35<= 0 60 N Pass
2 AAA 2013J 30268 F North Western Region A Level or Equivalent 20-40% 35<= 0 60 Y Withdrawn
3 AAA 2013J 31604 F South East Region A Level or Equivalent 40-60% 35<= 0 60 N Pass
4 AAA 2013J 32885 F West Midlands Region Lower Than A Level 40-60% 35> 0 60 N Pass
In [30]:
studentInfo_data = studentInfo_data.merge(regdata.loc[:
,['id_student','code_module','date_unregistration']], how='inner',
on=['id_student','code_module'])
In [31]:
studentInfo_data['final_result'] = np.where(~studentInfo_data['date_unregistration'].isnull(), 'Withdrawn', studentInfo_data['final_result'])
In [32]:
print(studentInfo_data.head())
  code_module code_presentation  id_student gender                region  \
0         AAA             2013J       11391      M   East Anglian Region   
1         AAA             2013J       28400      F              Scotland   
2         AAA             2013J       30268      F  North Western Region   
3         AAA             2013J       31604      F     South East Region   
4         AAA             2013J       32885      F  West Midlands Region   

       highest_education imd_band age_band  num_of_prev_attempts  \
0       HE Qualification  80-100%     35<=                     0   
1       HE Qualification   20-40%     35<=                     0   
2  A Level or Equivalent   20-40%     35<=                     0   
3  A Level or Equivalent   40-60%     35<=                     0   
4     Lower Than A Level   40-60%      35>                     0   

   studied_credits disability final_result  date_unregistration  
0              240          N         Pass                  NaN  
1               60          N         Pass                  NaN  
2               60          Y    Withdrawn                 12.0  
3               60          N         Pass                  NaN  
4               60          N         Pass                  NaN  

TASK2: Categorize students based on the day they registered for a course. In other words, you need to bin the registration data based on the date_registration column. Just to illustrate this idea, you can group students into categories such as "Very early birds", "early birds", "in-time", and "late-comers". You can use the categories given in this example or create your own categories.

In [33]:
birds = ["Very early birds", "Early birds", "in-time", "late-comers"]
regdata.date_registration = pd.qcut(regdata.date_registration, 4, labels=birds)
In [34]:
studentInfo_data = studentInfo_data.merge(regdata.loc[:
,['id_student','code_module','date_registration']], how='inner',
on=['id_student','code_module'])
In [35]:
studentInfo_data.head()
Out[35]:
code_module code_presentation id_student gender region highest_education imd_band age_band num_of_prev_attempts studied_credits disability final_result date_unregistration date_registration
0 AAA 2013J 11391 M East Anglian Region HE Qualification 80-100% 35<= 0 240 N Pass NaN Very early birds
1 AAA 2013J 28400 F Scotland HE Qualification 20-40% 35<= 0 60 N Pass NaN in-time
2 AAA 2013J 30268 F North Western Region A Level or Equivalent 20-40% 35<= 0 60 Y Withdrawn 12.0 Early birds
3 AAA 2013J 31604 F South East Region A Level or Equivalent 40-60% 35<= 0 60 N Pass NaN in-time
4 AAA 2013J 32885 F West Midlands Region Lower Than A Level 40-60% 35> 0 60 N Pass NaN Very early birds

TASK3: Categorize students based on the day they unregistered a course. In other words, you need to bin registration date based on the date_unregistration column. You are free to determine the number and the name of the categories (as in Task1).

In [36]:
unreg =  studentInfo_data.dropna(subset=['date_unregistration'])
In [37]:
unreg.loc[unreg.date_unregistration < 0] = math.nan
In [38]:
categ = ["first_half", "second_half"]
new_group = pd.cut(unreg.date_unregistration, 2, labels=categ)
In [39]:
unreg =  unreg.assign(date_unregistration = new_group)
In [40]:
studentInfo_data = studentInfo_data.assign(date_unregistration = pd.cut(studentInfo_data.date_unregistration, 2,
labels=["first half", "second half"]))
In [41]:
px.bar(unreg['date_unregistration'].value_counts())

there is a huge differences between first half and second half

TASK4: Choose THREE variables from demographic data (studentInfo.csv), and explore if there is some relationship between students' registration/unregistration behaviour and the chosen demographic variables (e.g., did students from HE registered early? did male students unregistered sooner than female students?). You are free in exploring the data to answer similar questions that you determine. If you find no relationship, this is totally fine. Just remember that your analysis should be accompanied with meaningful interpretations.

In [42]:
unreg.head()
Out[42]:
code_module code_presentation id_student gender region highest_education imd_band age_band num_of_prev_attempts studied_credits disability final_result date_unregistration date_registration
2 AAA 2013J 30268.0 F North Western Region A Level or Equivalent 20-40% 35<= 0.0 60.0 Y Withdrawn first_half Early birds
14 AAA 2013J 65002.0 F East Anglian Region A Level or Equivalent 60-80% 35> 0.0 60.0 N Withdrawn first_half Very early birds
15 AAA 2013J 65002.0 F East Anglian Region A Level or Equivalent 60-80% 35> 0.0 60.0 N Withdrawn first_half Very early birds
18 AAA 2014J 65002.0 F East Anglian Region A Level or Equivalent 60-80% 35> 1.0 60.0 N Withdrawn first_half Very early birds
19 AAA 2014J 65002.0 F East Anglian Region A Level or Equivalent 60-80% 35> 1.0 60.0 N Withdrawn first_half Very early birds
In [43]:
sns.catplot(data=studentInfo_data, x='highest_education', row='gender', kind="count",
hue='age_band', col='date_unregistration')
C:\Users\sidar\anaconda3\Lib\site-packages\seaborn\axisgrid.py:118: UserWarning:

The figure layout has changed to tight

Out[43]:
<seaborn.axisgrid.FacetGrid at 0x1aa31eef810>

When we look at these plots, many comments can be made. For example, the rate of unregistration is higher in men than in women, or the rate of people over 35 is higher than people under 35.

1.4. Course Components Table¶

Course components table (moodle.csv) contains information about the available materials in the Moodle LMS. Typically these are html pages, pdf files, etc. Students have access to these materials online and their interactions with the materials are recorded. The moodle.csv file contains the following columns:

  • id_site – an identification number of the material.
  • code_module – an identification code for module.
  • code_presentation - the identification code of presentation.
  • activity_type – the role associated with the module material.
  • week_from – the week from which the material is planned to be used.
  • week_to – week until which the material is planned to be used.

TASK1: In this dataset, some columns contain mainly missing values. Detect them and drop them to save space in the memory.

In [44]:
course_comp = pd.read_csv("moodle.csv")
print(course_comp.head())
   id_site code_module code_presentation activity_type  week_from  week_to
0   546943         AAA             2013J      resource        NaN      NaN
1   546712         AAA             2013J     oucontent        NaN      NaN
2   546998         AAA             2013J      resource        NaN      NaN
3   546888         AAA             2013J           url        NaN      NaN
4   547035         AAA             2013J      resource        NaN      NaN
In [45]:
course_comp.isna().sum()
Out[45]:
id_site                 0
code_module             0
code_presentation       0
activity_type           0
week_from            5243
week_to              5243
dtype: int64

week_from and week_to have 5243 missing values

In [46]:
course_comp.dropna(subset=['week_from', 'week_to'], inplace=True)
course_comp.isna().sum()
Out[46]:
id_site              0
code_module          0
code_presentation    0
activity_type        0
week_from            0
week_to              0
dtype: int64
In [47]:
course_comp.head()
Out[47]:
id_site code_module code_presentation activity_type week_from week_to
113 546732 AAA 2013J oucontent 2.0 2.0
198 546719 AAA 2013J oucontent 1.0 1.0
210 546681 AAA 2013J oucontent 1.0 1.0
264 877040 AAA 2014J oucontent 2.0 2.0
323 877045 AAA 2014J oucontent 1.0 1.0

TASK2: First identify the top 5 popular course component (activity_type) across all courses. Then, create a new table that displays how many times each of these popular components were included in each offering (code_presentation) of each course (code_module). Briefly interpret this table.

In [48]:
course_comp.activity_type.value_counts()
Out[48]:
activity_type
oucontent         381
url               352
resource          148
subpage           135
questionnaire      42
page               33
quiz               13
ouwiki              6
dualpane            5
repeatactivity      4
dataplus            2
Name: count, dtype: int64

ouconcet, url, resource, subpage, questionnaire are most popular activity type

In [49]:
popular_ac = ["oucontent",
"url",
"resource",
"subpage",
"questionnaire"]
In [50]:
filtered_course_comp = course_comp[course_comp['activity_type'].isin(popular_ac)]
In [51]:
component_counts = filtered_course_comp.groupby(['code_module', 'code_presentation', 'activity_type']).size().unstack(fill_value=0)
In [52]:
print(component_counts.head())
activity_type                  oucontent  questionnaire  resource  subpage  \
code_module code_presentation                                                
AAA         2013J                      3              0         0        0   
            2014J                      3              0         0        0   
BBB         2013J                      1              0        40       24   
            2014B                      1              0        40       24   
            2014J                     24              0         8        0   

activity_type                  url  
code_module code_presentation       
AAA         2013J                0  
            2014J                0  
BBB         2013J               11  
            2014B               11  
            2014J                1  
In [53]:
component_counts = component_counts.reset_index()
In [54]:
melted_data = pd.melt(component_counts, id_vars=['code_module', 'code_presentation'], var_name='activity_type', value_name='count')
In [55]:
sns.catplot(x='activity_type', y='count', hue='code_presentation', col='code_module', kind='bar', data=melted_data)
C:\Users\sidar\anaconda3\Lib\site-packages\seaborn\axisgrid.py:118: UserWarning:

The figure layout has changed to tight

Out[55]:
<seaborn.axisgrid.FacetGrid at 0x1aa35637d90>

Activity types generally appear in FFF and BBB

1.5. Student Activity Data¶

Student activity data (studentMoodleInteract.csv) contains information about each student’s interactions with the materials in the VLE. This file contains the following columns:

  • code_module – an identification code for a module.
  • code_presentation - the identification code of the module presentation.
  • id_student – a unique identification number for the student.
  • id_site - an identification number for the course material/component.
  • date – the date of student’s interaction with the material measured as the number of days since the start of the module-presentation.
  • sum_click – the number of times a student interacts with the material in that day.
In [56]:
studac_data =  pd.read_csv("studentMoodleInteract.csv")
studac_data.head()
Out[56]:
code_module code_presentation id_student id_site date sum_click
0 AAA 2013J 28400 546652 -10 4
1 AAA 2013J 28400 546652 -10 1
2 AAA 2013J 28400 546652 -10 1
3 AAA 2013J 28400 546614 -10 11
4 AAA 2013J 28400 546714 -10 1

TASK1: Display the total number of clicks for each course per each semester delivered. Besides a textual output, some visualizations must be provided for helping to interpret the data.

In [57]:
clicks = studac_data.groupby(['code_module', 'code_presentation'])['sum_click'].sum()
clicks = clicks.fillna(0)
print(clicks)
code_module  code_presentation
AAA          2013J                 648494
             2014J                 598158
BBB          2013B                1347911
             2013J                1378656
             2014B                 833865
             2014J                1673744
CCC          2014B                1889177
             2014J                2792972
DDD          2013B                1387467
             2013J                1757214
             2014B                 947657
             2014J                1437751
EEE          2013J                1535953
             2014B                 832038
             2014J                1616236
FFF          2013B                4220055
             2013J                5116439
             2014B                2975619
             2014J                5281809
GGG          2013J                 509091
             2014B                 425165
             2014J                 399628
Name: sum_click, dtype: int64
In [59]:
clicks_df = clicks.reset_index()
clicks_df
Out[59]:
code_module code_presentation sum_click
0 AAA 2013J 648494
1 AAA 2014J 598158
2 BBB 2013B 1347911
3 BBB 2013J 1378656
4 BBB 2014B 833865
5 BBB 2014J 1673744
6 CCC 2014B 1889177
7 CCC 2014J 2792972
8 DDD 2013B 1387467
9 DDD 2013J 1757214
10 DDD 2014B 947657
11 DDD 2014J 1437751
12 EEE 2013J 1535953
13 EEE 2014B 832038
14 EEE 2014J 1616236
15 FFF 2013B 4220055
16 FFF 2013J 5116439
17 FFF 2014B 2975619
18 FFF 2014J 5281809
19 GGG 2013J 509091
20 GGG 2014B 425165
21 GGG 2014J 399628
In [60]:
sns.barplot(x='code_module', y='sum_click', hue='code_presentation', data=clicks_df)
Out[60]:
<Axes: xlabel='code_module', ylabel='sum_click'>

in FFF course have most click in 2014j

TASK2: As a follow up to the first task, identify the courses in which the total number of clicks is higher in 2014 than 2013. If the course was taught two times in the same year (such as, 2013B and 2013J) use the average of both semesters ((2013B+2013J)/2) to compare with the other year.

In [61]:
clicks.index.name = None
clicks.index.name = None
In [62]:
clicks_pivot = clicks.unstack(level='code_presentation')
clicks_pivot = clicks_pivot.fillna(0)
print(clicks_pivot)
code_presentation      2013B      2013J      2014B      2014J
code_module                                                  
AAA                      0.0   648494.0        0.0   598158.0
BBB                1347911.0  1378656.0   833865.0  1673744.0
CCC                      0.0        0.0  1889177.0  2792972.0
DDD                1387467.0  1757214.0   947657.0  1437751.0
EEE                      0.0  1535953.0   832038.0  1616236.0
FFF                4220055.0  5116439.0  2975619.0  5281809.0
GGG                      0.0   509091.0   425165.0   399628.0
In [63]:
result_df = pd.DataFrame(index=clicks_pivot.index)
both_semesters_mask_2013 = (clicks_pivot['2013B'] > 0) & (clicks_pivot['2013J'] > 0)
total_clicks_2013 = clicks_pivot[['2013B', '2013J']].sum(axis=1)
total_clicks_2013[both_semesters_mask_2013] /= 2
both_semesters_mask_2014 = (clicks_pivot['2014B'] > 0) & (clicks_pivot['2014J'] > 0)
total_clicks_2014 = clicks_pivot[['2014B', '2014J']].sum(axis=1)
total_clicks_2014[both_semesters_mask_2014] /= 2
result_df['2013'] = total_clicks_2013
result_df['2014'] = total_clicks_2014
result_df['Higher_in_2014'] = total_clicks_2014 > total_clicks_2013
print(result_df)
                  2013       2014  Higher_in_2014
code_module                                      
AAA           648494.0   598158.0           False
BBB          1363283.5  1253804.5           False
CCC                0.0  2341074.5            True
DDD          1572340.5  1192704.0           False
EEE          1535953.0  1224137.0           False
FFF          4668247.0  4128714.0           False
GGG           509091.0   412396.5           False

except CCC course 2013 clicks is higher

TASK3: Which type of resources were mostly clicked by the students? Do you observe a common pattern accross courses (e.g., in almost all courses, clicks on resource is higher than quiz)? A heatmap as a visualization might be helpful here.

In [64]:
course_comp2 = pd.read_csv("moodle.csv")
In [65]:
course_comp2 = pd.DataFrame(course_comp2)
In [66]:
course_comp2.head()
Out[66]:
id_site code_module code_presentation activity_type week_from week_to
0 546943 AAA 2013J resource NaN NaN
1 546712 AAA 2013J oucontent NaN NaN
2 546998 AAA 2013J resource NaN NaN
3 546888 AAA 2013J url NaN NaN
4 547035 AAA 2013J resource NaN NaN
In [67]:
studac_data2 =  pd.read_csv("studentMoodleInteract.csv")
In [68]:
merge_data2 = course_comp2.merge(studac_data2.groupby(['id_site', 'code_module'])[['sum_click']].sum().reset_index(),
how='inner', on=['code_module', 'id_site'])
In [69]:
merge_data2.head()
Out[69]:
id_site code_module code_presentation activity_type week_from week_to sum_click
0 546943 AAA 2013J resource NaN NaN 36
1 546712 AAA 2013J oucontent NaN NaN 527
2 546998 AAA 2013J resource NaN NaN 6
3 546888 AAA 2013J url NaN NaN 15
4 547035 AAA 2013J resource NaN NaN 28
In [70]:
pivot_table2 = merge_data2.pivot_table(index='code_module', columns="activity_type",
values='sum_click', aggfunc='sum')
In [71]:
pivot_table2
Out[71]:
activity_type dataplus dualpane externalquiz folder forumng glossary homepage htmlactivity oucollaborate oucontent ouelluminate ouwiki page questionnaire quiz repeatactivity resource sharedsubpage subpage url
code_module
AAA 3398.0 NaN NaN NaN 337103.0 656.0 266720.0 NaN 498.0 523394.0 NaN NaN NaN NaN NaN NaN 14924.0 NaN 71582.0 28377.0
BBB NaN NaN NaN NaN 2109720.0 9820.0 1136792.0 NaN 12556.0 921260.0 1433.0 NaN NaN 6730.0 462995.0 NaN 213605.0 171.0 273600.0 85494.0
CCC NaN NaN NaN NaN 643094.0 NaN 756032.0 NaN 9822.0 516502.0 NaN NaN 4188.0 NaN 2164110.0 NaN 217846.0 NaN 342574.0 27981.0
DDD NaN NaN 64292.0 NaN 1493375.0 63336.0 1433206.0 NaN 53390.0 686331.0 13565.0 275867.0 1536.0 NaN NaN NaN 301644.0 NaN 979982.0 163565.0
EEE NaN 4871.0 NaN NaN 792656.0 NaN 654525.0 NaN 6920.0 1514727.0 NaN 457939.0 879.0 NaN 322772.0 NaN 52423.0 NaN 83270.0 93245.0
FFF 44070.0 15845.0 NaN 5420.0 2358160.0 6142.0 2418335.0 9239.0 25788.0 6567216.0 24030.0 160706.0 57028.0 58034.0 3831026.0 9.0 237075.0 NaN 1607759.0 168040.0
GGG NaN NaN NaN NaN 239282.0 8008.0 283454.0 NaN NaN 477373.0 NaN NaN NaN NaN 200337.0 NaN 72615.0 NaN 52815.0 NaN
In [ ]:
pivot_table2.plot(kind='bar').legend(bbox_to_anchor =(1, 1))
Out[ ]:
<matplotlib.legend.Legend at 0x790d6109db10>

When we look at all code_modules, we see that the highest values are generally quiz and oucontent. Quiz is larger in BBB, CCC, DDD modules, and oucontent is larger in other modules.

TASK4: For each student, compute the total number of clicks per each course component type (activity_type column in moodle.csv) separately for each course and semester. A simple representation of the expected table is provided below with fake data (note that in the given example columns and rows are incomplete).

Student Id code_module code_presentation PDF Assignment
1234 AAA 2013J 23 33
1234 BBB 2014B 5 42

Note that, in this task you actually create some features that can be used for predictive modeling.

In [72]:
ac_mod = studac_data2.merge(course_comp2, how='outer', on=['id_site', 'code_module','code_presentation'])
In [73]:
ac_mod = ac_mod.groupby(['id_student', 'id_site', 'code_module', 'code_presentation','activity_type'])['sum_click'].sum()
In [74]:
ac_mod = ac_mod.reset_index()
In [75]:
ac_mod = ac_mod.pivot_table(columns='activity_type', index=['id_student','code_module', 'code_presentation'], values='sum_click', aggfunc='sum')
In [76]:
ac_mod.columns.name = None
ac_mod.index.name = None
In [77]:
ac_mod = ac_mod.reset_index().fillna(0)
In [78]:
ac_mod.head()
Out[78]:
id_student code_module code_presentation dataplus dualpane externalquiz folder forumng glossary homepage ... ouelluminate ouwiki page questionnaire quiz repeatactivity resource sharedsubpage subpage url
0 6516.0 AAA 2014J 21.0 0.0 0.0 0.0 451.0 0.0 497.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 31.0 0.0 143.0 143.0
1 8462.0 DDD 2013J 0.0 0.0 12.0 0.0 36.0 0.0 184.0 ... 0.0 18.0 0.0 0.0 0.0 0.0 70.0 0.0 227.0 23.0
2 8462.0 DDD 2014J 0.0 0.0 0.0 0.0 2.0 0.0 7.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 11391.0 AAA 2013J 0.0 0.0 0.0 0.0 193.0 0.0 138.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 13.0 0.0 32.0 5.0
4 23629.0 BBB 2013B 0.0 0.0 0.0 0.0 87.0 0.0 36.0 ... 0.0 0.0 0.0 0.0 31.0 0.0 2.0 0.0 5.0 0.0

5 rows × 23 columns

TASK5: Using proper visualizations and statistical analysis, please explore if there is any relationship between students' course performance (final_result column in studentInfo.csv) and clicks on different resources.

In [79]:
studentInfo_data.head()
Out[79]:
code_module code_presentation id_student gender region highest_education imd_band age_band num_of_prev_attempts studied_credits disability final_result date_unregistration date_registration
0 AAA 2013J 11391 M East Anglian Region HE Qualification 80-100% 35<= 0 240 N Pass NaN Very early birds
1 AAA 2013J 28400 F Scotland HE Qualification 20-40% 35<= 0 60 N Pass NaN in-time
2 AAA 2013J 30268 F North Western Region A Level or Equivalent 20-40% 35<= 0 60 Y Withdrawn first half Early birds
3 AAA 2013J 31604 F South East Region A Level or Equivalent 40-60% 35<= 0 60 N Pass NaN in-time
4 AAA 2013J 32885 F West Midlands Region Lower Than A Level 40-60% 35> 0 60 N Pass NaN Very early birds
In [80]:
merged_data5 = ac_mod.merge(studentInfo_data.loc[:,['id_student', 'code_module', 'code_presentation','final_result']], how='inner', on=['id_student', 'code_module','code_presentation']).drop_duplicates()
merged_data5.head()
Out[80]:
id_student code_module code_presentation dataplus dualpane externalquiz folder forumng glossary homepage ... ouwiki page questionnaire quiz repeatactivity resource sharedsubpage subpage url final_result
0 6516.0 AAA 2014J 21.0 0.0 0.0 0.0 451.0 0.0 497.0 ... 0.0 0.0 0.0 0.0 0.0 31.0 0.0 143.0 143.0 Pass
1 8462.0 DDD 2013J 0.0 0.0 12.0 0.0 36.0 0.0 184.0 ... 18.0 0.0 0.0 0.0 0.0 70.0 0.0 227.0 23.0 Withdrawn
5 8462.0 DDD 2014J 0.0 0.0 0.0 0.0 2.0 0.0 7.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 Withdrawn
9 11391.0 AAA 2013J 0.0 0.0 0.0 0.0 193.0 0.0 138.0 ... 0.0 0.0 0.0 0.0 0.0 13.0 0.0 32.0 5.0 Pass
10 23629.0 BBB 2013B 0.0 0.0 0.0 0.0 87.0 0.0 36.0 ... 0.0 0.0 0.0 31.0 0.0 2.0 0.0 5.0 0.0 Fail

5 rows × 24 columns

In [81]:
resource_columns = ['dataplus', 'dualpane', 'externalquiz', 'folder', 'forumng', 'glossary', 'homepage',
                     'htmlactivity', 'oucollaborate', 'oucontent', 'ouelluminate', 'ouwiki',
                     'page', 'questionnaire', 'quiz', 'repeatactivity', 'resource',
                     'sharedsubpage', 'subpage', 'url']
In [82]:
melted222_data = merged_data5.melt(id_vars=['final_result'], value_vars=resource_columns, var_name='resource_type', value_name='clicks')
In [83]:
plt.figure(figsize=(15, 20))
for i, resource_type in enumerate(resource_columns, 1):
    plt.subplot(7, 3, i)
    sns.boxplot(x='final_result', y=resource_type, data=merged_data5)

according to box plot oucontent,quiz, homepage have relationship with final_result

2. Predictive Modeling¶

In this section, you will build a machine learning model to predict students' final course outcome (final_result column in studentInfo.csv). That is, whether student is 'Pass', 'Withdrawn', 'Fail', or 'Distinction'. If you consider the number of students in some of these categories are too few, you can combine them into a new category.

2.1. Generate Features from Demographic Information¶

In Section 1.2, you explored demographic data about students and tuned some categorical variables. From these categorical variables, please generate at least 10 dummy variables to be used for predictors in the machine learning model.

In [84]:
studentInfo_data.head()
Out[84]:
code_module code_presentation id_student gender region highest_education imd_band age_band num_of_prev_attempts studied_credits disability final_result date_unregistration date_registration
0 AAA 2013J 11391 M East Anglian Region HE Qualification 80-100% 35<= 0 240 N Pass NaN Very early birds
1 AAA 2013J 28400 F Scotland HE Qualification 20-40% 35<= 0 60 N Pass NaN in-time
2 AAA 2013J 30268 F North Western Region A Level or Equivalent 20-40% 35<= 0 60 Y Withdrawn first half Early birds
3 AAA 2013J 31604 F South East Region A Level or Equivalent 40-60% 35<= 0 60 N Pass NaN in-time
4 AAA 2013J 32885 F West Midlands Region Lower Than A Level 40-60% 35> 0 60 N Pass NaN Very early birds
In [85]:
mydummy = pd.get_dummies(studentInfo_data, columns=['gender', 'highest_education', 'region','imd_band', 'age_band', 'disability', 'date_registration'],drop_first=True)
In [86]:
mydummy.head()
Out[86]:
code_module code_presentation id_student num_of_prev_attempts studied_credits final_result date_unregistration gender_M highest_education_HE Qualification highest_education_Lower Than A Level ... region_Yorkshire Region imd_band_20-40% imd_band_40-60% imd_band_60-80% imd_band_80-100% age_band_35> disability_Y date_registration_Early birds date_registration_in-time date_registration_late-comers
0 AAA 2013J 11391 0 240 Pass NaN True True False ... False False False False True False False False False False
1 AAA 2013J 28400 0 60 Pass NaN False True False ... False True False False False False False False True False
2 AAA 2013J 30268 0 60 Withdrawn first half False False False ... False True False False False False True True False False
3 AAA 2013J 31604 0 60 Pass NaN False False False ... False False True False False False False False True False
4 AAA 2013J 32885 0 60 Pass NaN False False True ... False False True False False True False False False False

5 rows × 31 columns

I convert to categorical columns to dummy variables

2.2. Generate/Select Features from Click Data¶

In Section 1.5, you have already created some features from students' click behaviour. You can use all of them here as additional predictors.

Additionally, you should create at least 3 features indicating the engagement level of students at different course components. Some example features are provided below :

  • a dummy variable that indicates if students clicked at least three types of course components or not,
  • each student's average number of clicks across all components per a single course and semester,
  • a dummy variable indicating if students clicked all types of course components.

There is no limit in the type and number of additional feature you can generate from the click data.

In [87]:
ac_mod = ac_mod.merge(studentInfo_data.loc[:,['id_student', 'code_module', 'code_presentation','final_result']], how='inner', on=['id_student', 'code_module','code_presentation']).drop_duplicates()
In [88]:
mydummy = mydummy.merge(ac_mod.iloc[:,:-1], how='inner', on=['id_student', 'code_module','code_presentation'])
In [89]:
mydummy.head()
Out[89]:
code_module code_presentation id_student num_of_prev_attempts studied_credits final_result date_unregistration gender_M highest_education_HE Qualification highest_education_Lower Than A Level ... ouelluminate ouwiki page questionnaire quiz repeatactivity resource sharedsubpage subpage url
0 AAA 2013J 11391 0 240 Pass NaN True True False ... 0.0 0.0 0.0 0.0 0.0 0.0 13.0 0.0 32.0 5.0
1 AAA 2013J 28400 0 60 Pass NaN False True False ... 0.0 0.0 0.0 0.0 0.0 0.0 12.0 0.0 87.0 48.0
2 AAA 2013J 30268 0 60 Withdrawn first half False False False ... 0.0 0.0 0.0 0.0 0.0 0.0 4.0 0.0 22.0 4.0
3 AAA 2013J 31604 0 60 Pass NaN False False False ... 0.0 0.0 0.0 0.0 0.0 0.0 19.0 0.0 144.0 90.0
4 AAA 2013J 32885 0 60 Pass NaN False False True ... 0.0 0.0 0.0 0.0 0.0 0.0 45.0 0.0 79.0 14.0

5 rows × 51 columns

In [90]:
mydummy.columns
Out[90]:
Index(['code_module', 'code_presentation', 'id_student',
       'num_of_prev_attempts', 'studied_credits', 'final_result',
       'date_unregistration', 'gender_M', 'highest_education_HE Qualification',
       'highest_education_Lower Than A Level', 'region_East Midlands Region',
       'region_Ireland', 'region_London Region', 'region_North Region',
       'region_North Western Region', 'region_Scotland',
       'region_South East Region', 'region_South Region',
       'region_South West Region', 'region_Wales',
       'region_West Midlands Region', 'region_Yorkshire Region',
       'imd_band_20-40%', 'imd_band_40-60%', 'imd_band_60-80%',
       'imd_band_80-100%', 'age_band_35>', 'disability_Y',
       'date_registration_Early birds', 'date_registration_in-time',
       'date_registration_late-comers', 'dataplus', 'dualpane', 'externalquiz',
       'folder', 'forumng', 'glossary', 'homepage', 'htmlactivity',
       'oucollaborate', 'oucontent', 'ouelluminate', 'ouwiki', 'page',
       'questionnaire', 'quiz', 'repeatactivity', 'resource', 'sharedsubpage',
       'subpage', 'url'],
      dtype='object')
In [91]:
mydummy['engagement_level_3types'] = mydummy[['dataplus', 'dualpane', 'externalquiz', 'folder', 'forumng',
                                             'glossary', 'homepage', 'htmlactivity', 'oucollaborate',
                                             'oucontent', 'ouelluminate', 'ouwiki', 'page', 'questionnaire',
                                             'quiz', 'repeatactivity', 'resource', 'sharedsubpage', 'subpage',
                                             'url']].gt(0).sum(axis=1) >= 3
In [92]:
print(mydummy['engagement_level_3types'].value_counts())
engagement_level_3types
True     36793
False     1080
Name: count, dtype: int64

the column has two unique values (true and false), and the counts show how many students fall into each category. If true indicates that a student clicked at least three types of course components, then there are 1080 students in that category.

In [93]:
mydummy['avg_clicks_per_semester'] = mydummy[['dataplus', 'dualpane', 'externalquiz', 'folder', 'forumng',
                                              'glossary', 'homepage', 'htmlactivity', 'oucollaborate',
                                              'oucontent', 'ouelluminate', 'ouwiki', 'page', 'questionnaire',
                                              'quiz', 'repeatactivity', 'resource', 'sharedsubpage', 'subpage',
                                              'url']].mean(axis=1)
In [94]:
print(mydummy[['avg_clicks_per_semester']])
       avg_clicks_per_semester
0                        46.70
1                        71.75
2                        14.05
3                       107.90
4                        51.70
...                        ...
37868                     2.05
37869                    44.65
37870                    15.60
37871                    13.75
37872                    30.80

[37873 rows x 1 columns]

(avg_clicks_per_semester) representing the average number of clicks across all components for each student in a single course and semester.

In [95]:
print(mydummy['avg_clicks_per_semester'].describe())
count    37873.000000
mean        59.981657
std         80.353663
min          0.050000
25%         10.500000
50%         31.400000
75%         76.500000
max       1206.950000
Name: avg_clicks_per_semester, dtype: float64

Thisprovide information such as the mean, standard deviation, minimum, 25th percentile, median, 75th percentile, and maximum for the avg_clicks_per_semester column.

In [96]:
mydummy['engagement_level_alltypes'] = mydummy[['dataplus', 'dualpane', 'externalquiz', 'folder', 'forumng',
                                                'glossary', 'homepage', 'htmlactivity', 'oucollaborate',
                                                'oucontent', 'ouelluminate', 'ouwiki', 'page', 'questionnaire',
                                                'quiz', 'repeatactivity', 'resource', 'sharedsubpage', 'subpage',
                                                'url']].gt(0).all(axis=1)
In [97]:
mydummy['engagement_level_alltypes'].unique()
Out[97]:
array([False])

there is no student who clicked all components

In [104]:
mydummy.drop(columns='engagement_level_alltypes', inplace=True)
In [105]:
mydummy = pd.get_dummies(mydummy.drop(columns=['date_unregistration',
'id_student','dataplus', 'dualpane', 'externalquiz',
'folder', 'forumng', 'glossary', 'homepage', 'htmlactivity',
'oucollaborate', 'oucontent', 'ouelluminate', 'ouwiki',
'page','questionnaire', 'quiz', 'repeatactivity', 'resource',
'sharedsubpage','subpage', 'url']), columns=['code_module',
'code_presentation'], drop_first=True)
In [106]:
pd.set_option('display.max_columns', None)
mydummy.head()
Out[106]:
num_of_prev_attempts studied_credits final_result gender_M highest_education_HE Qualification highest_education_Lower Than A Level region_East Midlands Region region_Ireland region_London Region region_North Region region_North Western Region region_Scotland region_South East Region region_South Region region_South West Region region_Wales region_West Midlands Region region_Yorkshire Region imd_band_20-40% imd_band_40-60% imd_band_60-80% imd_band_80-100% age_band_35> disability_Y date_registration_Early birds date_registration_in-time date_registration_late-comers engagement_level_3types avg_clicks_per_semester code_module_BBB code_module_CCC code_module_DDD code_module_EEE code_module_FFF code_module_GGG code_presentation_2013J code_presentation_2014B code_presentation_2014J
0 0 240 Pass True True False False False False False False False False False False False False False False False False True False False False False False True 46.70 False False False False False False True False False
1 0 60 Pass False True False False False False False False True False False False False False False True False False False False False False True False True 71.75 False False False False False False True False False
2 0 60 Withdrawn False False False False False False False True False False False False False False False True False False False False True True False False True 14.05 False False False False False False True False False
3 0 60 Pass False False False False False False False False False True False False False False False False True False False False False False True False True 107.90 False False False False False False True False False
4 0 60 Pass False False True False False False False False False False False False False True False False True False False True False False False False True 51.70 False False False False False False True False False
In [107]:
mydummy['final_result'].unique()
Out[107]:
array(['Pass', 'Withdrawn', 'Fail', 'Distinction'], dtype=object)
In [110]:
mydummy['engagement_level_3types'] = mydummy['engagement_level_3types'].astype(int)
d = {'Pass': 1, 'Withdrawn':2, 'Distinction':3, 'Fail':0}
mydummy['final_result'] = mydummy['final_result'].apply(lambda x: d.get(x))
In [111]:
mydummy.head()
Out[111]:
num_of_prev_attempts studied_credits final_result gender_M highest_education_HE Qualification highest_education_Lower Than A Level region_East Midlands Region region_Ireland region_London Region region_North Region region_North Western Region region_Scotland region_South East Region region_South Region region_South West Region region_Wales region_West Midlands Region region_Yorkshire Region imd_band_20-40% imd_band_40-60% imd_band_60-80% imd_band_80-100% age_band_35> disability_Y date_registration_Early birds date_registration_in-time date_registration_late-comers engagement_level_3types avg_clicks_per_semester code_module_BBB code_module_CCC code_module_DDD code_module_EEE code_module_FFF code_module_GGG code_presentation_2013J code_presentation_2014B code_presentation_2014J
0 0 240 1 True True False False False False False False False False False False False False False False False False True False False False False False 1 46.70 False False False False False False True False False
1 0 60 1 False True False False False False False False True False False False False False False True False False False False False False True False 1 71.75 False False False False False False True False False
2 0 60 2 False False False False False False False True False False False False False False False True False False False False True True False False 1 14.05 False False False False False False True False False
3 0 60 1 False False False False False False False False False True False False False False False False True False False False False False True False 1 107.90 False False False False False False True False False
4 0 60 1 False False True False False False False False False False False False False True False False True False False True False False False False 1 51.70 False False False False False False True False False
In [112]:
mydummy = mydummy[mydummy.final_result.apply(lambda x: x in [0,1])]

2.3. Training and Testing the Model¶

As the last activity in this project, you are expected to train and test a logistic regression model for predicting students' final course status. You should use 10-fold cross-validation.

Interpret the results based on confusion matrix and AUC scores. In your interpretation, please also mention the features with high predictive power and those with low predictive power.

Please note that the achieving low/high accuracy in the predictions has no importance for your grade.

In [113]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
In [114]:
x_train, x_test, y_train, y_test = train_test_split(mydummy.
                                                    drop(columns='final_result'), mydummy.loc[:,'final_result'], shuffle=True,
                                                    train_size=.8)
In [115]:
model = LogisticRegression(max_iter=3000)
In [116]:
model.fit(X=x_train, y=y_train)
Out[116]:
LogisticRegression(max_iter=3000)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(max_iter=3000)
In [117]:
cross_val_score(model, x_test, y_test, cv=10)
Out[117]:
array([0.80288462, 0.75480769, 0.78846154, 0.77163462, 0.76923077,
       0.75721154, 0.77403846, 0.72836538, 0.78846154, 0.77884615])
In [118]:
y_pred = cross_val_predict(model, x_test, y_test, cv=10)
conf_mat = confusion_matrix(y_test, y_pred)
In [119]:
print(classification_report(y_test, y_pred, target_names=['failed', 'passed']))
              precision    recall  f1-score   support

      failed       0.74      0.62      0.68      1611
      passed       0.78      0.86      0.82      2549

    accuracy                           0.77      4160
   macro avg       0.76      0.74      0.75      4160
weighted avg       0.77      0.77      0.77      4160

In [120]:
from sklearn.metrics import roc_auc_score
roc_auc_score(y_test, y_pred)
Out[120]:
0.7443586767026127
In [121]:
for i,v in enumerate(model.coef_[0]):
    print(f'Feature: {x_train.columns[i]:<40} Score: {v:.5f}')
Feature: num_of_prev_attempts                     Score: -0.31264
Feature: studied_credits                          Score: 0.00066
Feature: gender_M                                 Score: 0.03422
Feature: highest_education_HE Qualification       Score: 0.09359
Feature: highest_education_Lower Than A Level     Score: -0.54654
Feature: region_East Midlands Region              Score: 0.18754
Feature: region_Ireland                           Score: 0.10683
Feature: region_London Region                     Score: -0.15509
Feature: region_North Region                      Score: -0.07396
Feature: region_North Western Region              Score: -0.19340
Feature: region_Scotland                          Score: -0.13420
Feature: region_South East Region                 Score: 0.19938
Feature: region_South Region                      Score: 0.14321
Feature: region_South West Region                 Score: 0.11194
Feature: region_Wales                             Score: -0.18568
Feature: region_West Midlands Region              Score: 0.09697
Feature: region_Yorkshire Region                  Score: -0.04849
Feature: imd_band_20-40%                          Score: 0.22244
Feature: imd_band_40-60%                          Score: 0.31824
Feature: imd_band_60-80%                          Score: 0.35537
Feature: imd_band_80-100%                         Score: 0.52841
Feature: age_band_35>                             Score: 0.22388
Feature: disability_Y                             Score: -0.18569
Feature: date_registration_Early birds            Score: -0.02331
Feature: date_registration_in-time                Score: -0.05832
Feature: date_registration_late-comers            Score: -0.08498
Feature: engagement_level_3types                  Score: 2.27893
Feature: avg_clicks_per_semester                  Score: 0.02883
Feature: code_module_BBB                          Score: 0.16872
Feature: code_module_CCC                          Score: -1.12619
Feature: code_module_DDD                          Score: -0.67909
Feature: code_module_EEE                          Score: -0.77596
Feature: code_module_FFF                          Score: -2.24652
Feature: code_module_GGG                          Score: 0.25861
Feature: code_presentation_2013J                  Score: 0.52654
Feature: code_presentation_2014B                  Score: 0.50116
Feature: code_presentation_2014J                  Score: 0.77358
In [ ]: